The Home Equity dataset (HMEQ) contains baseline and loan performance information for 5,960 recent home equity loans. The target (BAD) is a binary variable that indicates whether an applicant has ultimately defaulted or has been severely delinquent. This adverse outcome occurred in 1,189 cases (20 percent). 12 input variables were registered for each applicant.
BAD: 1 = Client defaulted on loan, 0 = loan repaid
LOAN: Amount of loan approved.
MORTDUE: Amount due on the existing mortgage.
VALUE: Current value of the property.
REASON: Reason for the loan request. (HomeImp = home improvement, DebtCon= debt consolidation which means taking out a new loan to pay off other liabilities and consumer debts)
JOB: The type of job that loan applicant has such as manager, self, etc.
YOJ: Years at present job.
DEROG: Number of major derogatory reports (which indicates a serious delinquency or late payments).
DELINQ: Number of delinquent credit lines (a line of credit becomes delinquent when a borrower does not make the minimum required payments 30 to 60 days past the day on which the payments were due).
CLAGE: Age of the oldest credit line in months.
NINQ: Number of recent credit inquiries.
CLNO: Number of existing credit lines.
DEBTINC: Debt-to-income ratio (all your monthly debt payments divided by your gross monthly income. This number is one way lenders measure your ability to manage the monthly payments to repay the money you plan to borrow.
# data handling & viz
import numpy as np # for treating outliers
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# suppressing warnings
import warnings
warnings.filterwarnings("ignore")
# data prep
from sklearn.model_selection import train_test_split
# scaling data before modeling
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# regression classifier
from sklearn.linear_model import LogisticRegression
# to create classification trees
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# visualize trees
from sklearn import tree
# tune parameters
from sklearn.model_selection import GridSearchCV
# SVM models
from sklearn.svm import SVC
# gradient boosting tree classifier
from xgboost.sklearn import XGBClassifier
# metrics to assess model performance
from sklearn.metrics import (
f1_score, # balance of recall & precision (harmonic mean)
accuracy_score, # % correct (all predictions)
recall_score, # ability to find relevant cases in dataset
precision_score, # ability to find only relevant data
confusion_matrix,
classification_report,
make_scorer, # for scoring models when running grid search
precision_recall_curve
)
# read in data and view first 2 rows to verify
df = pd.read_csv("C:/Users/jeske/Documents/MIT Applied Data Sci/MIT ADSP Capstone Projects/Practical DS and Classification/hmeq.csv")
df.head(2)
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | NaN |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | NaN |
# check how much data there is
df.shape
(5960, 13)
# check out columns, data types, presence of missing data
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5960 entries, 0 to 5959 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 BAD 5960 non-null int64 1 LOAN 5960 non-null int64 2 MORTDUE 5442 non-null float64 3 VALUE 5848 non-null float64 4 REASON 5708 non-null object 5 JOB 5681 non-null object 6 YOJ 5445 non-null float64 7 DEROG 5252 non-null float64 8 DELINQ 5380 non-null float64 9 CLAGE 5652 non-null float64 10 NINQ 5450 non-null float64 11 CLNO 5738 non-null float64 12 DEBTINC 4693 non-null float64 dtypes: float64(9), int64(2), object(2) memory usage: 605.4+ KB
Observations:
# Get % of missing data
df.isna().sum().sort_values(ascending = False)/len(df)
DEBTINC 0.212584 DEROG 0.118792 DELINQ 0.097315 MORTDUE 0.086913 YOJ 0.086409 NINQ 0.085570 CLAGE 0.051678 JOB 0.046812 REASON 0.042282 CLNO 0.037248 VALUE 0.018792 BAD 0.000000 LOAN 0.000000 dtype: float64
# check for duplicates
df.duplicated().sum()
0
Observation: No duplicates
# basic stats for numeric columns
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| BAD | 5960.0 | 0.199497 | 0.399656 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| LOAN | 5960.0 | 18607.969799 | 11207.480417 | 1100.000000 | 11100.000000 | 16300.000000 | 23300.000000 | 89900.000000 |
| MORTDUE | 5442.0 | 73760.817200 | 44457.609458 | 2063.000000 | 46276.000000 | 65019.000000 | 91488.000000 | 399550.000000 |
| VALUE | 5848.0 | 101776.048741 | 57385.775334 | 8000.000000 | 66075.500000 | 89235.500000 | 119824.250000 | 855909.000000 |
| YOJ | 5445.0 | 8.922268 | 7.573982 | 0.000000 | 3.000000 | 7.000000 | 13.000000 | 41.000000 |
| DEROG | 5252.0 | 0.254570 | 0.846047 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| DELINQ | 5380.0 | 0.449442 | 1.127266 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 |
| CLAGE | 5652.0 | 179.766275 | 85.810092 | 0.000000 | 115.116702 | 173.466667 | 231.562278 | 1168.233561 |
| NINQ | 5450.0 | 1.186055 | 1.728675 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 17.000000 |
| CLNO | 5738.0 | 21.296096 | 10.138933 | 0.000000 | 15.000000 | 20.000000 | 26.000000 | 71.000000 |
| DEBTINC | 4693.0 | 33.779915 | 8.601746 | 0.524499 | 29.140031 | 34.818262 | 39.003141 | 203.312149 |
Observations:
# Taking a look at the 2 categorical variables
df.describe(include=(['object'])).T
| count | unique | top | freq | |
|---|---|---|---|---|
| REASON | 5708 | 2 | DebtCon | 3928 |
| JOB | 5681 | 6 | Other | 2388 |
# look at distribution of data within each categorical variable
cat_col = ['BAD', 'REASON', 'JOB']
for column in cat_col:
print(df[column].value_counts(1))
print("-" * 50)
BAD 0 0.800503 1 0.199497 Name: proportion, dtype: float64 -------------------------------------------------- REASON DebtCon 0.688157 HomeImp 0.311843 Name: proportion, dtype: float64 -------------------------------------------------- JOB Other 0.420349 ProfExe 0.224608 Office 0.166872 Mgr 0.135011 Self 0.033973 Sales 0.019187 Name: proportion, dtype: float64 --------------------------------------------------
Observations:
More than 2/3 of loans are for debt consolidation. The rest, about 31%, are for home improvement.
Leading Questions:
Looking at distribution of values across the dataset for individual variables - shape of the distribution, outliers
# distribution plots of categorical variables (excl target var)
fig, axs = plt.subplots(ncols=2, figsize=(10,4))
sns.histplot(df, x='JOB', ax=axs[0])
sns.histplot(df, x='REASON', ax=axs[1])
plt.show()
def histogram_boxplot(data, feature, figsize=(10, 6), kde=False):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (10,6))
kde: whether to the show density curve (default False)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows = 2, # Number of rows of the subplot grid = 2
sharex = True, # x-axis will be shared among all subplots
gridspec_kw = {"height_ratios": (0.2, 0.8)},
figsize = figsize,
) # Creating the 2 subplots
sns.boxplot(data = data, x = feature, ax = ax_box2, showmeans = True, color = "skyblue")
# Boxplot will be created and a star will indicate the mean value of the column
sns.histplot(data = data, x = feature, kde = kde, ax = ax_hist2) # For histogram
ax_hist2.axvline(data[feature].mean(), color = "green", linestyle = "--") # Add mean to the histogram
ax_hist2.axvline(data[feature].median(), color = "black", linestyle = "-") # Add median to the histogram
# begin plotting all of the numeric variable distributions
histogram_boxplot(df, 'LOAN', kde=True)
histogram_boxplot(df, 'MORTDUE', kde=True)
histogram_boxplot(df, 'VALUE', kde=True)
Observations:
histogram_boxplot(df, 'YOJ')
Observations:
histogram_boxplot(df, 'DEROG')
histogram_boxplot(df, 'DELINQ')
histogram_boxplot(df, 'CLAGE', kde=True)
Observations:
df[df.CLAGE > 1100]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3097 | 1 | 16800 | 87300.0 | 155500.0 | DebtCon | Other | 3.0 | 0.0 | 0.0 | 1154.633333 | 0.0 | 0.0 | NaN |
| 3679 | 1 | 19300 | 96454.0 | 157809.0 | DebtCon | Other | 3.0 | 0.0 | 0.0 | 1168.233561 | 0.0 | 0.0 | 40.206138 |
Observations:
# force display of all rows
pd.set_option('display.max_rows', None)
df[df.CLNO==0]
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 21 | 1 | 2400 | 50000.0 | 73395.0 | HomeImp | ProfExe | 5.0 | 1.0 | 0.0 | NaN | 1.0 | 0.0 | NaN |
| 92 | 0 | 4000 | NaN | 45760.0 | HomeImp | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 118 | 0 | 4500 | NaN | 49044.0 | HomeImp | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 11.652739 |
| 220 | 0 | 5300 | NaN | 49396.0 | HomeImp | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 12.043671 |
| 298 | 0 | 5900 | NaN | 51189.0 | HomeImp | NaN | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 12.749181 |
| 329 | 0 | 6000 | NaN | 53190.0 | HomeImp | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 15.174415 |
| 341 | 0 | 6100 | NaN | 46830.0 | HomeImp | NaN | 0.0 | 0.0 | 1.0 | NaN | 0.0 | 0.0 | 13.306013 |
| 418 | 0 | 6600 | NaN | 48800.0 | HomeImp | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 12.219436 |
| 422 | 0 | 6600 | NaN | 46516.0 | HomeImp | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 14.845991 |
| 552 | 0 | 7400 | NaN | 54138.0 | HomeImp | NaN | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 12.219680 |
| 578 | 1 | 7500 | NaN | 40150.0 | HomeImp | Other | 5.0 | 0.0 | 0.0 | NaN | 1.0 | 0.0 | NaN |
| 659 | 1 | 8000 | 33000.0 | 38500.0 | HomeImp | Other | 2.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 723 | 1 | 8200 | 47700.0 | 60000.0 | DebtCon | Other | 0.1 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 1493 | 1 | 11100 | NaN | 26400.0 | HomeImp | Other | 8.0 | 0.0 | 0.0 | NaN | 2.0 | 0.0 | NaN |
| 1724 | 0 | 12000 | NaN | 63000.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 1780 | 0 | 12100 | NaN | 72731.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 0.720295 |
| 1855 | 0 | 12400 | NaN | 69350.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 2.365195 |
| 1856 | 1 | 12400 | 94000.0 | 112000.0 | DebtCon | Mgr | 4.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 2078 | 0 | 13100 | NaN | 65933.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 3.657371 |
| 2113 | 0 | 13300 | NaN | 72583.0 | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 3.720421 |
| 2192 | 0 | 13600 | NaN | 71904.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 3.265083 |
| 2203 | 1 | 13600 | 70000.0 | 88000.0 | DebtCon | Mgr | 16.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 2289 | 1 | 13900 | 103030.0 | 114131.0 | DebtCon | Mgr | 3.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 38.394222 |
| 2365 | 0 | 14300 | NaN | 63319.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 3.960859 |
| 2387 | 0 | 14400 | NaN | 69712.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 2.553522 |
| 2617 | 0 | 15000 | NaN | 68020.0 | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 1.920694 |
| 2633 | 0 | 15100 | NaN | 65961.0 | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 1.603508 |
| 2635 | 1 | 15100 | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 2659 | 0 | 15200 | NaN | 67103.0 | NaN | NaN | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 3.688958 |
| 2801 | 1 | 15700 | 83761.0 | 125860.0 | HomeImp | Other | 1.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 2828 | 1 | 15800 | 74815.0 | 89721.0 | DebtCon | Mgr | 16.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 37.984505 |
| 3097 | 1 | 16800 | 87300.0 | 155500.0 | DebtCon | Other | 3.0 | 0.0 | 0.0 | 1154.633333 | 0.0 | 0.0 | NaN |
| 3211 | 1 | 17200 | 50742.0 | 71000.0 | DebtCon | Other | 15.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 3679 | 1 | 19300 | 96454.0 | 157809.0 | DebtCon | Other | 3.0 | 0.0 | 0.0 | 1168.233561 | 0.0 | 0.0 | 40.206138 |
| 3690 | 1 | 19400 | 86219.0 | 126904.0 | HomeImp | Other | 0.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 34.480850 |
| 3941 | 1 | 20600 | 48500.0 | 72000.0 | DebtCon | Other | 17.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 4247 | 1 | 22100 | 57000.0 | 83000.0 | DebtCon | Other | 7.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 4492 | 1 | 23500 | 52457.0 | 77436.0 | DebtCon | Other | 16.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 33.568766 |
| 4781 | 1 | 25000 | 103700.0 | 172762.0 | DebtCon | Other | 4.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 4854 | 1 | 25500 | 63967.0 | 87239.0 | DebtCon | Other | 6.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 33.051074 |
| 4883 | 1 | 25600 | 65030.0 | 92453.0 | DebtCon | Other | 5.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 35.028856 |
| 5124 | 1 | 27500 | 137900.0 | 184000.0 | DebtCon | Office | 10.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 5375 | 1 | 30800 | 147577.0 | 187129.0 | DebtCon | Office | 11.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | 42.362253 |
| 5543 | 0 | 35000 | 31000.0 | 50000.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | NaN |
| 5546 | 0 | 35100 | 33844.0 | 55100.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 6.0 | 0.0 | 34.703024 |
| 5556 | 0 | 35600 | 31807.0 | 51823.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 34.443092 |
| 5566 | 0 | 36100 | 36948.0 | 56236.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 31.058019 |
| 5568 | 0 | 36200 | 38010.0 | 59111.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 6.0 | 0.0 | 31.332141 |
| 5569 | 0 | 36200 | 36974.0 | 54452.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 31.164354 |
| 5570 | 0 | 36200 | 36661.0 | 53622.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 6.0 | 0.0 | 32.014692 |
| 5571 | 0 | 36300 | 35287.0 | 56532.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 32.624532 |
| 5578 | 0 | 36400 | 38263.0 | 57311.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 31.536654 |
| 5587 | 0 | 36600 | 32590.0 | 58157.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 6.0 | 0.0 | 33.016931 |
| 5588 | 0 | 36800 | 35447.0 | 59890.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 31.173844 |
| 5590 | 0 | 36800 | 35077.0 | 53120.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 6.0 | 0.0 | 34.985013 |
| 5595 | 0 | 37000 | 36910.0 | 58338.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 34.891585 |
| 5607 | 0 | 37700 | 37200.0 | 51389.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 6.0 | 0.0 | 32.774331 |
| 5608 | 0 | 37800 | 31214.0 | 58282.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 34.226190 |
| 5635 | 0 | 38900 | 31836.0 | 54976.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 6.0 | 0.0 | 33.459246 |
| 5637 | 0 | 39000 | 36335.0 | 50704.0 | DebtCon | Other | NaN | 0.0 | 0.0 | NaN | 7.0 | 0.0 | 31.244399 |
| 5784 | 1 | 47000 | 159500.0 | 230000.0 | DebtCon | Office | 3.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
| 5804 | 1 | 49500 | 9000.0 | 55000.0 | DebtCon | Other | 3.0 | 0.0 | 0.0 | NaN | 0.0 | 0.0 | NaN |
# reset
pd.set_option('display.max_rows', 20)
histogram_boxplot(df, 'NINQ')
histogram_boxplot(df, 'CLNO', kde=True)
histogram_boxplot(df, 'DEBTINC', kde=True)
Observations:
# Function for getting counts, % and plot of loan default/not by attribute
def status_by_feature(data, feature):
"""
data = dataframe
feature = variable of interest
Plots out how different levels of a feature relate to target
"""
ft_levels=list(data[feature].value_counts().index) # levels of feature
# iterate through levels of feature for count and default rate for each
for i in ft_levels:
print(f'{i} Counts:', data[data[feature]==i].BAD.count()) # counts for attr levels
print(f'{i} Default %:', round(data[data[feature]==i].BAD.mean()*100, 2)) # % of attr levels
print("-" * 50)
# generate chart to display the distribution of levels within the feature
sns.countplot(data, x=feature, hue='BAD', palette='Paired')
plt.ylabel('# of Loans')
plt.show()
status_by_feature(df, 'JOB')
Other Counts: 2388 Other Default %: 23.2 -------------------------------------------------- ProfExe Counts: 1276 ProfExe Default %: 16.61 -------------------------------------------------- Office Counts: 948 Office Default %: 13.19 -------------------------------------------------- Mgr Counts: 767 Mgr Default %: 23.34 -------------------------------------------------- Self Counts: 193 Self Default %: 30.05 -------------------------------------------------- Sales Counts: 109 Sales Default %: 34.86 --------------------------------------------------
status_by_feature(df, 'REASON')
DebtCon Counts: 3928 DebtCon Default %: 18.97 -------------------------------------------------- HomeImp Counts: 1780 HomeImp Default %: 22.25 --------------------------------------------------
# Create function for violin plots split by loan defaul/not
def violinplot_by_status(data, feature):
"""
data = dataframe
feature = variable of interest
Plots violin plots of variable, separated by loan status (default/paid)
"""
fig, axs = plt.subplots(nrows=2, sharex=True, gridspec_kw = {"height_ratios": (0.35, 0.65)},figsize = (8,8))
sns.violinplot(data=data, x=feature, y='BAD', orient='h', legend=False, ax=axs[0])
sns.histplot(data=data, x=feature, hue='BAD', kde=True, ax=axs[1])
plt.xlabel(feature)
plt.show()
violinplot_by_status(df, 'LOAN')
violinplot_by_status(df, 'MORTDUE')
violinplot_by_status(df, 'VALUE')
violinplot_by_status(df, 'YOJ')
violinplot_by_status(df, 'CLAGE')
violinplot_by_status(df, 'NINQ')
# probably better with count plot instead
violinplot_by_status(df, 'CLNO')
violinplot_by_status(df, 'DEBTINC')
violinplot_by_status(df, 'DELINQ')
violinplot_by_status(df, 'DEROG')
# Now let's look at relationships among the numeric variables
# Creating a set of pairwise scatterplots for numeric variables
col_names = df.drop(columns=['JOB', 'REASON', 'BAD']).columns # list of numeric columns
n_cols = len(col_names)
# create list of paired variables
col_pairs=[]
for i in range(0, n_cols-1):
for j in range(i+1, n_cols): # to only get unique pairs
col_pairs.append([col_names[i], col_names[j]])
n_pairs=len(col_pairs)
# create subplots
# inverted floor division to get # rows
fig, axs = plt.subplots(ncols=2, nrows=-(n_pairs//-2), figsize=(10,n_pairs*2))
for i in range(0, n_pairs):
sns.scatterplot(df, x=col_pairs[i][0], y=col_pairs[i][1], marker='+', ax=axs[i//2, i%2])
plt.show()
# Checking relationships between categorical and numeric variables (JOB, REASON)
# n_cols, col_names defined above
fig, axs = plt.subplots(nrows=n_cols, figsize = (8, n_cols*6))
for i in range(0, n_cols):
sns.violinplot(df, x=col_names[i], y='JOB', orient='h', legend=False, ax=axs[i])
plt.show()
# And repeating for REASON
fig, axs = plt.subplots(nrows=n_cols, figsize = (8, n_cols*4))
for i in range(0, n_cols):
sns.violinplot(df, x=col_names[i], y='REASON', orient='h', legend=False, ax=axs[i])
plt.show()
Observations:
# convert to category type
cat_col = ['BAD', 'REASON', 'JOB']
for i in cat_col:
df[i] = pd.Categorical(df[i])
# Let's get an overview of how variables might be related
# There are a lot of variables, so I'm splitting the pairplot to make it readable
l=len(col_names)//2
col_set1=col_names[:l] # first half of columns
col_set2=col_names[l:] # second half of columns
sns.pairplot(df, x_vars=col_set1, y_vars=col_names, hue='BAD', corner=True)
<seaborn.axisgrid.PairGrid at 0x215431ac250>
# Second half of pairplot
sns.pairplot(df, x_vars=col_set2, y_vars=col_set2, hue='BAD', corner=True)
<seaborn.axisgrid.PairGrid at 0x2154796b790>
# checking heatmap of numerical variables
plt.figure(figsize = (16, 12))
sns.heatmap(df.drop(columns=['JOB', 'REASON']).corr(),
annot = True, fmt = '.2f', cmap='magma_r')
plt.show()
Observations:
# backup copy, just in case
df_copy = df.copy()
# Replace outliers beyond whiskers
def rep_outlier_numeric(df, feature):
"""
Identifies and replaces outliers beyond 1.5*IQR from IQR
df = dataframe
feature = column to treat
"""
q1 = df[feature].quantile(0.25)
q3 = df[feature].quantile(0.75)
whisker = 1.5 * (q3 - q1)
# define range
lower_bound = q1 - whisker
upper_bound = q3 + whisker
# replace outliers
df[feature] = np.clip(df[feature], lower_bound, upper_bound)
return df
Outliers:
# list of all numeric variables from above: col_names
# interate through list to treat outliers in all
col_names = df.drop(columns=['JOB', 'REASON', 'BAD']).columns # list of numeric columns
col_names = col_names.drop(['DELINQ', 'DEROG']) # omit from treatment
for i in col_names:
rep_outlier_numeric(df, i)
# Visualizing missing data (blank/white = missing)
import missingno as msno
msno.matrix(df)
<Axes: >
# Checking out relationships in missingness
msno.heatmap(df)
<Axes: >
Missing Data:
# plot missing value count by row
sns.histplot(df.isnull().sum(axis=1), discrete=True)
plt.title('# Missing Values per Row');
# functions to impute median (numeric) and mode (categorical)
def impute_median(df, feature):
df[feature].fillna(df[feature].median(), inplace=True)
def impute_mode(df, feature):
df[feature].fillna(df[feature].mode()[0], inplace=True)
# replace missing numeric data
col_names = df.drop(columns=['JOB', 'REASON', 'BAD']).columns
for i in col_names:
impute_median(df, i)
# replace missing categorical data
cat_cols=['JOB', 'REASON']
for i in cat_cols:
impute_mode(df, i)
# and check what the data looks like
df.head()
| BAD | LOAN | MORTDUE | VALUE | REASON | JOB | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1100 | 25860.0 | 39025.0 | HomeImp | Other | 10.5 | 0.0 | 0.0 | 94.366667 | 1.0 | 9.0 | 34.818262 |
| 1 | 1 | 1300 | 70053.0 | 68400.0 | HomeImp | Other | 7.0 | 0.0 | 2.0 | 121.833333 | 0.0 | 14.0 | 34.818262 |
| 2 | 1 | 1500 | 13500.0 | 16700.0 | HomeImp | Other | 4.0 | 0.0 | 0.0 | 149.466667 | 1.0 | 10.0 | 34.818262 |
| 3 | 1 | 1500 | 65019.0 | 89235.5 | DebtCon | Other | 7.0 | 0.0 | 0.0 | 173.466667 | 1.0 | 20.0 | 34.818262 |
| 4 | 0 | 1700 | 97800.0 | 112000.0 | HomeImp | Office | 3.0 | 0.0 | 0.0 | 93.333333 | 0.0 | 14.0 | 34.818262 |
# check summary stats to make sure they are similar to before
# and that missing values have been filled
df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| LOAN | 5960.0 | 18051.895973 | 9252.565294 | 1100.000000 | 11100.000000 | 16300.000000 | 23300.000000 | 41600.000000 |
| MORTDUE | 5960.0 | 70997.067819 | 35597.710401 | 2063.000000 | 48139.000000 | 65019.000000 | 88200.250000 | 159306.000000 |
| VALUE | 5960.0 | 98363.244470 | 44663.105774 | 8000.000000 | 66489.500000 | 89235.500000 | 119004.750000 | 200447.375000 |
| YOJ | 5960.0 | 8.711300 | 7.122031 | 0.000000 | 3.000000 | 7.000000 | 12.000000 | 28.000000 |
| DEROG | 5960.0 | 0.224329 | 0.798458 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| DELINQ | 5960.0 | 0.405705 | 1.079256 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 |
| CLAGE | 5960.0 | 178.368680 | 78.395960 | 0.000000 | 117.371430 | 173.466667 | 227.143058 | 406.230642 |
| NINQ | 5960.0 | 1.085403 | 1.312898 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 5.000000 |
| CLNO | 5960.0 | 20.994379 | 9.245170 | 0.000000 | 15.000000 | 20.000000 | 26.000000 | 42.500000 |
| DEBTINC | 5960.0 | 33.923529 | 6.348461 | 14.345367 | 30.763159 | 34.818262 | 37.949892 | 53.797805 |
What are the the most important observations and insights from the data based on the EDA performed?
# Data preparation:
# Separating the target variable and other variables
X = df.drop(columns = 'BAD')
Y = df['BAD']
# Feature engineering needed for categorical variables:
# JOB, REASON
X = pd.get_dummies(X, drop_first=True, dtype=int)
# split into train/test, standard 70/30 split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.30, random_state = 101)
## print out size of training/testing data
print("Shape of the training set: ", X_train.shape)
print("Shape of the test set: ", X_test.shape)
# check distribution of target classes between train and test
print("\nPercentage of classes in the training set:")
print(y_train.value_counts(normalize = True))
print("\nPercentage of classes in the test set:")
print(y_test.value_counts(normalize = True))
Shape of the training set: (4172, 16) Shape of the test set: (1788, 16) Percentage of classes in the training set: BAD 0 0.803691 1 0.196309 Name: proportion, dtype: float64 Percentage of classes in the test set: BAD 0 0.793065 1 0.206935 Name: proportion, dtype: float64
# Check test and train stats to ensure similarity
X_train.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| LOAN | 4172.0 | 18064.046021 | 9253.628976 | 1300.000000 | 11100.000000 | 16300.000000 | 23300.000000 | 41600.000000 |
| MORTDUE | 4172.0 | 71340.154700 | 35526.413929 | 2063.000000 | 48593.000000 | 65019.000000 | 88624.000000 | 159306.000000 |
| VALUE | 4172.0 | 98817.951170 | 44577.562739 | 8000.000000 | 66917.250000 | 89235.500000 | 120000.000000 | 200447.375000 |
| YOJ | 4172.0 | 8.687740 | 7.114061 | 0.000000 | 3.000000 | 7.000000 | 12.000000 | 28.000000 |
| DEROG | 4172.0 | 0.210211 | 0.763066 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| DELINQ | 4172.0 | 0.391659 | 1.054273 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 |
| CLAGE | 4172.0 | 178.119118 | 79.070412 | 0.000000 | 116.510638 | 173.466667 | 226.942514 | 406.230642 |
| NINQ | 4172.0 | 1.092042 | 1.312268 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 5.000000 |
| CLNO | 4172.0 | 21.017737 | 9.236538 | 0.000000 | 15.000000 | 20.000000 | 26.000000 | 42.500000 |
| DEBTINC | 4172.0 | 33.908495 | 6.353183 | 14.345367 | 30.651564 | 34.818262 | 37.960666 | 53.797805 |
| REASON_HomeImp | 4172.0 | 0.296021 | 0.456555 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| JOB_Office | 4172.0 | 0.159636 | 0.366312 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| JOB_Other | 4172.0 | 0.449185 | 0.497471 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| JOB_ProfExe | 4172.0 | 0.211409 | 0.408357 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| JOB_Sales | 4172.0 | 0.017498 | 0.131132 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| JOB_Self | 4172.0 | 0.029722 | 0.169840 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
X_test.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| LOAN | 1788.0 | 18023.545861 | 9252.609206 | 1100.000000 | 11100.000000 | 16400.000000 | 23300.000000 | 41600.000000 |
| MORTDUE | 1788.0 | 70196.531762 | 35760.676215 | 4447.000000 | 46971.000000 | 65019.000000 | 87829.250000 | 159306.000000 |
| VALUE | 1788.0 | 97302.262170 | 44856.643310 | 9500.000000 | 65737.250000 | 89235.500000 | 117831.000000 | 200447.375000 |
| YOJ | 1788.0 | 8.766275 | 7.142282 | 0.000000 | 3.000000 | 7.000000 | 13.000000 | 28.000000 |
| DEROG | 1788.0 | 0.257271 | 0.874835 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 |
| DELINQ | 1788.0 | 0.438479 | 1.135043 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 10.000000 |
| CLAGE | 1788.0 | 178.950992 | 76.817926 | 0.486711 | 120.236041 | 173.466667 | 227.326848 | 406.230642 |
| NINQ | 1788.0 | 1.069911 | 1.314603 | 0.000000 | 0.000000 | 1.000000 | 2.000000 | 5.000000 |
| CLNO | 1788.0 | 20.939877 | 9.267636 | 0.000000 | 15.000000 | 20.000000 | 26.000000 | 42.500000 |
| DEBTINC | 1788.0 | 33.958609 | 6.339067 | 14.345367 | 31.123635 | 34.818262 | 37.910318 | 53.797805 |
| REASON_HomeImp | 1788.0 | 0.304810 | 0.460456 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| JOB_Office | 1788.0 | 0.157718 | 0.364578 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| JOB_Other | 1788.0 | 0.443512 | 0.496938 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 1.000000 |
| JOB_ProfExe | 1788.0 | 0.220358 | 0.414604 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| JOB_Sales | 1788.0 | 0.020134 | 0.140499 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
| JOB_Self | 1788.0 | 0.038591 | 0.192671 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
Observation: Training set has the higher value outliers for DELINQ and DEROG. Otherwise train and test data are similar
# Create copy of X test/train sets without DEROG and DELINQ to compare performance if all outliers were removed
# Splitting the data here means train/test datasets are otherwise identical
X_train_trim = X_train.copy()
X_train_trim.drop(columns=['DEROG', 'DELINQ'], inplace=True)
X_test_trim = X_test.copy()
X_test_trim.drop(columns=['DEROG', 'DELINQ'], inplace=True)
# Scale data for use in some models
scaler = StandardScaler()
#mmscaler = MinMaxScaler()
# Get list of features
feature_names = X_train.columns
# New copies of data
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
# fit to training data and transfor
X_train_scaled = scaler.fit_transform(X_train_scaled)
X_train_scaled = pd.DataFrame(X_train_scaled, columns = feature_names)
# transform test data
X_test_scaled = scaler.transform(X_test_scaled)
X_test_scaled = pd.DataFrame(X_test_scaled, columns = feature_names)
# Function to print classification report and get confusion matrix in a proper format
def metrics_score(actual, predicted):
print("Accuracy:", round(accuracy_score(actual, predicted), 4))
print('\n')
print(classification_report(actual, predicted))
cm = confusion_matrix(actual, predicted)
plt.figure(figsize = (6, 4))
sns.heatmap(cm, annot = True, fmt = '.2f', cmap='Greens',
xticklabels = ['Repaid', 'Defaulted'], yticklabels = ['Repaid', 'Defaulted'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.show()
# create model - using newton-cholesky as it's binary, not huge and relative small # features
lr = LogisticRegression(solver='newton-cholesky')
# fit to data
lr.fit(X_train, y_train)
# predict on training data
lr_pred_train = lr.predict(X_train)
# predict on test data
lr_pred = lr.predict(X_test)
# Performance on training data
metrics_score(y_train, lr_pred_train)
Accuracy: 0.8401
precision recall f1-score support
0 0.85 0.97 0.91 3353
1 0.72 0.30 0.43 819
accuracy 0.84 4172
macro avg 0.79 0.64 0.67 4172
weighted avg 0.83 0.84 0.81 4172
# Performance on test data
metrics_score(y_test, lr_pred)
Accuracy: 0.8501
precision recall f1-score support
0 0.86 0.97 0.91 1418
1 0.78 0.38 0.51 370
accuracy 0.85 1788
macro avg 0.82 0.68 0.71 1788
weighted avg 0.84 0.85 0.83 1788
# create model for "trimmed" data (no outliers)
lr_trim = LogisticRegression(solver='newton-cholesky')
# fit to data
lr_trim.fit(X_train_trim, y_train)
# predict on training data
lr_pred_train_trim = lr_trim.predict(X_train_trim)
# predict on test data
lr_pred_trim = lr_trim.predict(X_test_trim)
# Performance on training data
metrics_score(y_train, lr_pred_train_trim)
Accuracy: 0.8135
precision recall f1-score support
0 0.82 0.99 0.89 3353
1 0.66 0.10 0.18 819
accuracy 0.81 4172
macro avg 0.74 0.54 0.54 4172
weighted avg 0.79 0.81 0.75 4172
# Performance on test data
metrics_score(y_test, lr_pred_trim)
Accuracy: 0.7981
precision recall f1-score support
0 0.81 0.98 0.89 1418
1 0.58 0.09 0.16 370
accuracy 0.80 1788
macro avg 0.69 0.54 0.52 1788
weighted avg 0.76 0.80 0.73 1788
# Now let's see what happens when we use standardized data
# create model - using newton-cholesky as it's binary, not huge and relative small # features
lr_scaled = LogisticRegression(solver='newton-cholesky')
# fit to data
lr_scaled.fit(X_train_scaled, y_train)
# predict on training data
lr_scaled_pred_train = lr_scaled.predict(X_train_scaled)
# predict on test data
lr_scaled_pred = lr_scaled.predict(X_test_scaled)
# Performance on training data
metrics_score(y_train, lr_scaled_pred_train)
Accuracy: 0.8401
precision recall f1-score support
0 0.85 0.97 0.91 3353
1 0.72 0.30 0.42 819
accuracy 0.84 4172
macro avg 0.79 0.64 0.67 4172
weighted avg 0.83 0.84 0.81 4172
# Performance on test data
metrics_score(y_test, lr_scaled_pred)
Accuracy: 0.8501
precision recall f1-score support
0 0.86 0.97 0.91 1418
1 0.78 0.38 0.51 370
accuracy 0.85 1788
macro avg 0.82 0.68 0.71 1788
weighted avg 0.84 0.85 0.83 1788
Observations:
# Adjust hyperparameters of Logistic Regression
# create new lr
lr_tune = LogisticRegression(class_weight={0: 0.2, 1: 0.8})
# Grid of parameters to choose from
lr_params = {"solver": ['lbfgs', 'liblinear', 'newton-cg', 'newton-cholesky'],
"penalty": [None, 'l1', 'l2'],
"C": [0.1, 1, 10]
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = make_scorer(precision_score, pos_label = 1)
# Run the grid search, using only sample due to grid size
# Narrow down grid and then re-run with all training data
grid_obj = GridSearchCV(lr_tune, lr_params, scoring=scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
lr_tune = grid_obj.best_estimator_
# print out model parameters
grid_obj.best_params_
{'C': 1, 'penalty': 'l2', 'solver': 'liblinear'}
# predict on training data
lr_tune_pred_train = lr_tune.predict(X_train)
# predict on test data
lr_tune_pred = lr_tune.predict(X_test)
# Performance on training data
metrics_score(y_train, lr_tune_pred_train)
Accuracy: 0.7519
precision recall f1-score support
0 0.91 0.77 0.83 3353
1 0.42 0.68 0.52 819
accuracy 0.75 4172
macro avg 0.66 0.72 0.68 4172
weighted avg 0.81 0.75 0.77 4172
# Performance on test data
metrics_score(y_test, lr_tune_pred)
Accuracy: 0.7556
precision recall f1-score support
0 0.90 0.77 0.83 1418
1 0.44 0.69 0.54 370
accuracy 0.76 1788
macro avg 0.67 0.73 0.69 1788
weighted avg 0.81 0.76 0.77 1788
Observations:
# How do the features relate to the best logistic model?
# Using the initial logistic regression model
# get coefficients and convert to odds
odds = np.exp(lr.coef_[0])
# turn into df and sort descending
pd.DataFrame(odds, feature_names, columns = ['odds']).sort_values(by = 'odds', ascending = False)
| odds | |
|---|---|
| JOB_Sales | 2.862375 |
| DELINQ | 2.125800 |
| JOB_Self | 1.845870 |
| DEROG | 1.766820 |
| NINQ | 1.238886 |
| REASON_HomeImp | 1.234934 |
| DEBTINC | 1.079087 |
| JOB_ProfExe | 1.029029 |
| JOB_Other | 1.000896 |
| VALUE | 1.000002 |
| MORTDUE | 0.999996 |
| LOAN | 0.999975 |
| CLAGE | 0.993997 |
| YOJ | 0.991328 |
| CLNO | 0.978654 |
| JOB_Office | 0.485649 |
Observations:
# Using the tuned logistic regression, plotting the precision-recall curve
# get probability of each observation belonging to each class
y_prob_lr = lr_tune.predict_proba(X_train)
precision_lr, recall_lr, threshold_lr = precision_recall_curve(y_train, y_prob_lr[:, 1])
# Plot values of precision, recall, and threshold
plt.figure(figsize = (8, 6))
plt.plot(threshold_lr, precision_lr[:-1], 'b--', label = 'precision')
plt.plot(threshold_lr, recall_lr[:-1], 'g--', label = 'recall')
plt.xlabel('Threshold')
plt.legend(loc = 'upper right')
plt.xlim([0, 1])
plt.ylim([0, 1])
plt.show()
Observation: Optimal threshold is ~0.3. Re-score the data with this
optimum = 0.62
metrics_score(y_train, y_prob_lr[:, 1] > optimum)
Accuracy: 0.814
precision recall f1-score support
0 0.88 0.88 0.88 3353
1 0.53 0.53 0.53 819
accuracy 0.81 4172
macro avg 0.71 0.71 0.71 4172
weighted avg 0.81 0.81 0.81 4172
# Get probabilities for test data and score with defined optimum
prob_lr = lr_tune.predict_proba(X_test)
metrics_score(y_test, prob_lr[:, 1] > optimum)
Accuracy: 0.816
precision recall f1-score support
0 0.88 0.89 0.88 1418
1 0.56 0.55 0.55 370
accuracy 0.82 1788
macro avg 0.72 0.72 0.72 1788
weighted avg 0.81 0.82 0.82 1788
# apply optimal threshold to create new prediction columns for test & train data
# for later use in comparing metrics
lr_tune_opt_pred = [1 if i > optimum else 0 for i in prob_lr[:, 1]] # test
lr_tune_opt_pred_tr = [1 if i > optimum else 0 for i in y_prob_lr[:, 1]] # train
Observations:
# created and fit decision tree model
# setting class weights of 0.2, 0.8 based on split in target classes (0.8, 0.2)
dtree = DecisionTreeClassifier(random_state = 101, class_weight={0: 0.2, 1: 0.8})
dtree.fit(X_train, y_train)
# predict on training data
dt_pred_tr = dtree.predict(X_train)
# predict on test data
dt_pred = dtree.predict(X_test)
# assess performance on training data
metrics_score(y_train, dt_pred_tr)
Accuracy: 1.0
precision recall f1-score support
0 1.00 1.00 1.00 3353
1 1.00 1.00 1.00 819
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
# assess performance on test data
metrics_score(y_test, dt_pred)
Accuracy: 0.844
precision recall f1-score support
0 0.90 0.90 0.90 1418
1 0.63 0.61 0.62 370
accuracy 0.84 1788
macro avg 0.76 0.76 0.76 1788
weighted avg 0.84 0.84 0.84 1788
Observations:
# How does a model perform if it's trained on data that is not processed
# Data that has missing data and outliers
# Separating the target variable and other variables
X_raw = df_copy.drop(columns = 'BAD')
Y_raw = df_copy['BAD']
# Feature engineering needed for categorical variables:
# JOB, REASON into binary (not bool)
X_raw = pd.get_dummies(X_raw, drop_first=True, dtype=int)
# split into train/test, standard 70/30 split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_raw, Y_raw, test_size = 0.30, random_state = 101)
## print out size of training/testing data
print("Shape of the training set: ", X_train_r.shape)
print("Shape of the test set: ", X_test_r.shape)
# check distribution of target classes between train and test
print("\nPercentage of classes in the training set:")
print(y_train_r.value_counts(normalize = True))
print("\nPercentage of classes in the test set:")
print(y_test_r.value_counts(normalize = True))
Shape of the training set: (4172, 16) Shape of the test set: (1788, 16) Percentage of classes in the training set: BAD 0 0.803691 1 0.196309 Name: proportion, dtype: float64 Percentage of classes in the test set: BAD 0 0.793065 1 0.206935 Name: proportion, dtype: float64
# create and fit decision tree model
# setting class weights of 0.2, 0.8 based on split in target classes (0.8, 0.2)
dtree_r = DecisionTreeClassifier(random_state = 101, class_weight={0: 0.2, 1: 0.8})
dtree_r.fit(X_train_r, y_train_r)
# predict on training data
dt_pred_r_tr = dtree_r.predict(X_train_r)
# predict on test data
dt_pred_r = dtree_r.predict(X_test_r)
# assess performance on training data
metrics_score(y_train_r, dt_pred_r_tr)
Accuracy: 0.9657
precision recall f1-score support
0 0.99 0.96 0.98 3353
1 0.87 0.97 0.92 819
accuracy 0.97 4172
macro avg 0.93 0.97 0.95 4172
weighted avg 0.97 0.97 0.97 4172
# assess performance on test data
metrics_score(y_test_r, dt_pred_r)
Accuracy: 0.8652
precision recall f1-score support
0 0.92 0.90 0.91 1418
1 0.66 0.72 0.69 370
accuracy 0.87 1788
macro avg 0.79 0.81 0.80 1788
weighted avg 0.87 0.87 0.87 1788
# How does a model perform if it's trained on data that is not processed
# Data that has missing data, but no outliers
# Separating the target variable and other variables
X_nm = df_copy.copy().drop(columns = 'BAD')
Y_nm = df_copy['BAD']
# list of all numeric variables from above: col_names
# interate through list to treat outliers in all
col_names = col_names.drop(['DELINQ', 'DEROG'])
for i in col_names:
rep_outlier_numeric(X_nm, i)
# Feature engineering needed for categorical variables:
# JOB, REASON
X_nm = pd.get_dummies(X_nm, drop_first=True, dtype=int)
# split into train/test, standard 70/30 split
X_train_nm, X_test_nm, y_train_nm, y_test_nm = train_test_split(X_nm, Y_nm, test_size = 0.30, random_state = 101)
## print out size of training/testing data
print("Shape of the training set: ", X_train_nm.shape)
print("Shape of the test set: ", X_test_nm.shape)
# check distribution of target classes between train and test
print("\nPercentage of classes in the training set:")
print(y_train_nm.value_counts(normalize = True))
print("\nPercentage of classes in the test set:")
print(y_test_nm.value_counts(normalize = True))
Shape of the training set: (4172, 16) Shape of the test set: (1788, 16) Percentage of classes in the training set: BAD 0 0.803691 1 0.196309 Name: proportion, dtype: float64 Percentage of classes in the test set: BAD 0 0.793065 1 0.206935 Name: proportion, dtype: float64
X_train_nm.head()
| LOAN | MORTDUE | VALUE | YOJ | DEROG | DELINQ | CLAGE | NINQ | CLNO | DEBTINC | REASON_HomeImp | JOB_Office | JOB_Other | JOB_ProfExe | JOB_Sales | JOB_Self | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2253 | 13800 | 31974.0 | 44417.0 | 7.0 | 0.0 | 0.0 | 76.933123 | 1.0 | 12.0 | 30.422429 | 1 | 0 | 1 | 0 | 0 | 0 |
| 5354 | 30400 | 40386.0 | 68120.0 | 12.0 | 1.0 | 0.0 | 98.820154 | 5.0 | 15.0 | 34.020257 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4342 | 22600 | 63363.0 | 90816.0 | 3.0 | NaN | NaN | 174.326561 | NaN | 15.0 | 38.742482 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3405 | 18000 | 52301.0 | 79959.0 | 1.0 | NaN | 1.0 | 290.280648 | 0.0 | 9.0 | 37.163698 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2112 | 13300 | 64002.0 | 88174.0 | 3.0 | NaN | NaN | 157.916990 | NaN | 34.0 | 29.798231 | 0 | 0 | 1 | 0 | 0 | 0 |
# create and fit decision tree model
# setting class weights of 0.2, 0.8 based on split in target classes (0.8, 0.2)
dtree_nm = DecisionTreeClassifier(random_state = 101, class_weight={0: 0.2, 1: 0.8})
dtree_nm.fit(X_train_nm, y_train_nm)
# predict on training data
dt_pred_nm_tr = dtree_nm.predict(X_train_nm)
# predict on test data
dt_pred_nm = dtree_nm.predict(X_test_nm)
# assess performance on training data
metrics_score(y_train_nm, dt_pred_nm_tr)
# assess performance on test data
metrics_score(y_test_nm, dt_pred_nm)
Accuracy: 0.9643
precision recall f1-score support
0 0.99 0.97 0.98 3353
1 0.87 0.96 0.91 819
accuracy 0.96 4172
macro avg 0.93 0.96 0.95 4172
weighted avg 0.97 0.96 0.96 4172
Accuracy: 0.8602
precision recall f1-score support
0 0.92 0.90 0.91 1418
1 0.65 0.70 0.68 370
accuracy 0.86 1788
macro avg 0.79 0.80 0.79 1788
weighted avg 0.86 0.86 0.86 1788
Observations:
# How does a model perform with scaled data?
# Apply scaling to the raw data from above
dt_scaler = MinMaxScaler()
# New copies of data
X_train_rs = X_train_r.copy()
X_test_rs = X_test_r.copy()
# fit to training data and transfor
X_train_rs = dt_scaler.fit_transform(X_train_rs)
X_train_rs = pd.DataFrame(X_train_rs, columns = feature_names)
# transform test data
X_test_rs = dt_scaler.transform(X_test_rs)
X_test_rs = pd.DataFrame(X_test_rs, columns = feature_names)
# create and fit decision tree model
# setting class weights of 0.2, 0.8 based on split in target classes (0.8, 0.2)
dtree_rs = DecisionTreeClassifier(random_state = 101, class_weight={0: 0.2, 1: 0.8})
dtree_rs.fit(X_train_rs, y_train_r)
# predict on training data
dt_pred_rs_tr = dtree_rs.predict(X_train_rs)
# predict on test data
dt_pred_rs = dtree_rs.predict(X_test_rs)
# assess performance on training data
metrics_score(y_train_r, dt_pred_rs_tr)
# assess performance on test data
metrics_score(y_test_r, dt_pred_rs)
Accuracy: 0.9657
precision recall f1-score support
0 0.99 0.96 0.98 3353
1 0.87 0.97 0.92 819
accuracy 0.97 4172
macro avg 0.93 0.97 0.95 4172
weighted avg 0.97 0.97 0.97 4172
Accuracy: 0.8647
precision recall f1-score support
0 0.92 0.90 0.91 1418
1 0.66 0.71 0.69 370
accuracy 0.86 1788
macro avg 0.79 0.81 0.80 1788
weighted avg 0.87 0.86 0.87 1788
You can learn about more Hyperpapameters on this link and try to tune them.
https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html
# new tree, defined weights by target distribution (inverse -> balance)
dt_tune = DecisionTreeClassifier(random_state = 101, class_weight = {0: 0.2, 1: 0.8})
# grid of parameters to choose from
params = {'max_depth': np.arange(5, 10),
'criterion': ['gini', 'entropy'],
'min_samples_leaf': [20, 30, 40],
'class_weight': ["balanced", {0: 0.2, 1: 0.8}]
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(f1_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(dt_tune, params, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train_r, y_train_r)
# select model with best combination of parameters
dt_tune = grid_obj.best_estimator_
# fit "best" model
dt_tune.fit(X_train_r, y_train_r)
DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=7, min_samples_leaf=20, random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. DecisionTreeClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=7, min_samples_leaf=20, random_state=101)# assess performance on training data
dt_pred_tr_tune_r = dt_tune.predict(X_train_r)
metrics_score(y_train_r, dt_pred_tr_tune_r)
Accuracy: 0.8547
precision recall f1-score support
0 0.96 0.86 0.90 3353
1 0.59 0.85 0.70 819
accuracy 0.85 4172
macro avg 0.77 0.85 0.80 4172
weighted avg 0.89 0.85 0.86 4172
# assess performance on test data
dt_pred_tune_r = dt_tune.predict(X_test_r)
metrics_score(y_test_r, dt_pred_tune_r)
Accuracy: 0.8255
precision recall f1-score support
0 0.94 0.83 0.88 1418
1 0.55 0.80 0.65 370
accuracy 0.83 1788
macro avg 0.75 0.82 0.77 1788
weighted avg 0.86 0.83 0.84 1788
# Time to see what the tree actually looks like
plt.figure(figsize = (30, 20))
tree.plot_tree(dt_tune, feature_names = feature_names, filled = True, fontsize = 9, node_ids = True)
plt.show()
Observation:
# How important are the features?
# get importance of each from model
imp_dtree = dt_tune.feature_importances_
indices = np.argsort(imp_dtree)
plt.figure(figsize = (12, 10))
plt.title('Feature Importances')
# plot bars
plt.barh(range(len(indices)), imp_dtree[indices], align = 'center')
# label bars
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
Random Forest is a bagging algorithm where the base models are Decision Trees. Samples are taken from the training data and on each sample a decision tree makes a prediction.
The results from all the decision trees are combined together and the final prediction is made using voting or averaging.
# Build Random Forest classifier - initial parameters should not be constraining models, based on above results
rfc = RandomForestClassifier(n_estimators = 100, random_state=101, criterion='entropy',
max_depth=9, min_samples_split=20, class_weight={0: 0.2, 1: 0.8})
rfc.fit(X_train, y_train)
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=9, min_samples_split=20, random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=9, min_samples_split=20, random_state=101)# assess performance on training data
pred_rfc_tr = rfc.predict(X_train)
metrics_score(y_train, pred_rfc_tr)
Accuracy: 0.9286
precision recall f1-score support
0 0.97 0.94 0.95 3353
1 0.78 0.88 0.83 819
accuracy 0.93 4172
macro avg 0.88 0.91 0.89 4172
weighted avg 0.93 0.93 0.93 4172
# assess performance on test data
pred_rfc = rfc.predict(X_test)
metrics_score(y_test, pred_rfc)
Accuracy: 0.8887
precision recall f1-score support
0 0.95 0.91 0.93 1418
1 0.70 0.81 0.75 370
accuracy 0.89 1788
macro avg 0.82 0.86 0.84 1788
weighted avg 0.90 0.89 0.89 1788
# Adjust hyperparameters of Random Forest Classifier
# create new rfc
rfc_tune = RandomForestClassifier(random_state = 101, criterion='entropy', class_weight={0: 0.2, 1: 0.8})
# Grid of parameters to choose from
rfc_params = {"n_estimators": [80, 100],
"max_depth": np.arange(7, 10),
"max_features": [0.8, 0.9],
"min_samples_split": [10, 20],
#"class_weight": ["balanced",{0: 0.2, 1: 0.8}]
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
#scorer = make_scorer(f1_score, pos_label = 1)
scorer = make_scorer(precision_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(rfc_tune, rfc_params, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
rfc_tune = grid_obj.best_estimator_
# Print out "best" model
rfc_tune
RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=9, max_features=0.9, min_samples_split=10,
random_state=101)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. RandomForestClassifier(class_weight={0: 0.2, 1: 0.8}, criterion='entropy',
max_depth=9, max_features=0.9, min_samples_split=10,
random_state=101)# assess performance on training data
pred_rfc_tune_tr = rfc_tune.predict(X_train)
metrics_score(y_train, pred_rfc_tune_tr)
Accuracy: 0.9314
precision recall f1-score support
0 0.97 0.94 0.96 3353
1 0.79 0.89 0.84 819
accuracy 0.93 4172
macro avg 0.88 0.92 0.90 4172
weighted avg 0.94 0.93 0.93 4172
# assess performance on test data
pred_rfc_tune = rfc_tune.predict(X_test)
metrics_score(y_test, pred_rfc_tune)
Accuracy: 0.8758
precision recall f1-score support
0 0.94 0.90 0.92 1418
1 0.67 0.77 0.72 370
accuracy 0.88 1788
macro avg 0.81 0.84 0.82 1788
weighted avg 0.88 0.88 0.88 1788
# Is there a way to visualize a random forest classifier?
# get importance of each from model
imp_rfc = rfc_tune.feature_importances_
indices = np.argsort(imp_rfc)
plt.figure(figsize = (12, 10))
plt.title('Feature Importances')
# plot bars
plt.barh(range(len(indices)), imp_rfc[indices], align = 'center')
# label bars
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel('Relative Importance')
plt.show()
# instantiate SVM model
s=SVC(random_state = 101, class_weight={0: 0.2, 1: 0.8})
# fit to data
s.fit(X_train, y_train)
# predict on training and test data
pred_svm_tr=s.predict(X_train)
pred_svm=s.predict(X_test)
# assess performance on training data
metrics_score(y_train, pred_svm_tr)
Accuracy: 0.6493
precision recall f1-score support
0 0.83 0.71 0.76 3353
1 0.26 0.42 0.32 819
accuracy 0.65 4172
macro avg 0.54 0.56 0.54 4172
weighted avg 0.72 0.65 0.68 4172
# assess performance on test data
metrics_score(y_test, pred_svm)
Accuracy: 0.6314
precision recall f1-score support
0 0.82 0.69 0.75 1418
1 0.26 0.41 0.31 370
accuracy 0.63 1788
macro avg 0.54 0.55 0.53 1788
weighted avg 0.70 0.63 0.66 1788
# Let's repeat, using scaled data to see if that improves model
# instantiate SVM model
s_sc=SVC(random_state = 101, class_weight={0: 0.2, 1: 0.8})
# fit to data
s_sc.fit(X_train_scaled, y_train)
# predict on training and test data
pred_svm_sc_tr=s_sc.predict(X_train_scaled)
pred_svm_sc=s_sc.predict(X_test_scaled)
# assess performance on training data
metrics_score(y_train, pred_svm_sc_tr)
Accuracy: 0.8221
precision recall f1-score support
0 0.94 0.83 0.88 3353
1 0.53 0.80 0.64 819
accuracy 0.82 4172
macro avg 0.74 0.81 0.76 4172
weighted avg 0.86 0.82 0.83 4172
# assess performance on test data
metrics_score(y_test, pred_svm_sc)
Accuracy: 0.8143
precision recall f1-score support
0 0.93 0.83 0.88 1418
1 0.54 0.76 0.63 370
accuracy 0.81 1788
macro avg 0.73 0.79 0.75 1788
weighted avg 0.85 0.81 0.82 1788
%%time
# Now let's try to tune the SVM for better performance
# create new svm
svm_tune = SVC(random_state = 101, class_weight={0: 0.2, 1: 0.8})
# Grid of parameters to choose from
svm_params = {"kernel": ['poly', 'rbf', 'sigmoid'],
"C": [0.1, 1, 10, 100],
"gamma": [1, 0.1, 0.01, 'scale']
}
# Type of scoring used to compare parameter combinations - f1 score for class 1
scorer = make_scorer(f1_score, pos_label = 1)
# Run the grid search
grid_obj = GridSearchCV(svm_tune, svm_params, scoring = scorer, cv = 5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the classifier to the best combination of parameters
svm_tune = grid_obj.best_estimator_
# display chosen parameters
grid_obj.best_params_
# predict and evaluate training data
pred_svm_tune_tr=svm_tune.predict(X_train_scaled)
metrics_score(y_train, pred_svm_tune_tr)
Accuracy: 0.9432
precision recall f1-score support
0 0.98 0.95 0.96 3353
1 0.81 0.94 0.87 819
accuracy 0.94 4172
macro avg 0.89 0.94 0.91 4172
weighted avg 0.95 0.94 0.94 4172
# predict and evaluate test data
pred_svm_tune=svm_tune.predict(X_test_scaled)
metrics_score(y_test, pred_svm_tune)
Accuracy: 0.8853
precision recall f1-score support
0 0.93 0.92 0.93 1418
1 0.71 0.75 0.73 370
accuracy 0.89 1788
macro avg 0.82 0.83 0.83 1788
weighted avg 0.89 0.89 0.89 1788
Observations:
SVC does yield a model with higher accuracy, recall and precision than decision tree or random forest. However it may be harder to interpret, thus harder to provide justification for denying a loan.
# instantiate model, weight = #neg/#pos
xgb = XGBClassifier(scale_pos_weight=4)
# define eval set for model to use
# using train and test data to plot metrics for both
# eval_set = [(X_test, y_test)]
eval_set = [(X_train, y_train), (X_test, y_test)]
# Fit the model; 2 metrics for plotting
xgb.fit(X_train, y_train, eval_set=eval_set, eval_metric=["error", "logloss"], verbose=False)
# retrieve performance metrics
results = xgb.evals_result()
epochs = len(results['validation_0']['error'])
x_axis = range(0, epochs)
# predict on training data and evaluate
xgb_pred_train = xgb.predict(X_train)
metrics_score(y_train, xgb_pred_train)
# predict on test data and evaluate
xgb_pred = xgb.predict(X_test)
metrics_score(y_test, xgb_pred)
Accuracy: 0.9995
precision recall f1-score support
0 1.00 1.00 1.00 3353
1 1.00 1.00 1.00 819
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Accuracy: 0.9195
precision recall f1-score support
0 0.95 0.95 0.95 1418
1 0.81 0.80 0.80 370
accuracy 0.92 1788
macro avg 0.88 0.87 0.88 1788
weighted avg 0.92 0.92 0.92 1788
# plot log loss
plt.figure(figsize = (6, 4))
plt.plot(x_axis, results['validation_0']['logloss'], 'b--', label = 'Train')
plt.plot(x_axis, results['validation_1']['logloss'], 'g--', label = 'Test')
plt.xlabel('Epoch')
plt.ylabel('Log Loss')
plt.title('XGBoost Log Loss')
plt.legend(loc = 'upper right')
plt.show()
# plot classification error
plt.figure(figsize = (6, 4))
plt.plot(x_axis, results['validation_0']['error'], 'b--', label = 'Train')
plt.plot(x_axis, results['validation_1']['error'], 'g--', label = 'Test')
plt.xlabel('Epoch')
plt.ylabel('Classification Error')
plt.title('XGBoost Classification Error')
plt.legend(loc = 'upper right')
plt.show()
Observations:
%%time
# let's try to tune it to better fit test data
# only need to evaluate on test data here
eval_set = [(X_test_r, y_test_r)]
estimator = XGBClassifier(objective= 'binary:logistic',
nthread=4,
seed=101,
n_estimators=100, # 100 trees per model
eval_set=eval_set,
verbose=False, # suppress output
eval_metric='logloss',
eta=0.1 # smaller step size
)
# parameters to adjust
xgb_params = {
'max_depth': range (6, 9),
'colsample_bylevel': [0.6, 0.7, 0.8],
'colsample_bytree': [0.6, 0.7, 0.8],
'subsample': [0.6, 0.7, 0.8]
}
# apply parameters and model to grid search
grid_obj = GridSearchCV(estimator=estimator, param_grid=xgb_params, scoring = scorer, cv=5)
# Using "raw" unprocessed data, but results are almost identical to processed data
# fit to data and search
grid_obj = grid_obj.fit(X_train_r, y_train_r)
# pick best model
xgb_tune = grid_obj.best_estimator_
# print out parameters
grid_obj.best_params_
{'colsample_bylevel': 0.8,
'colsample_bytree': 0.8,
'max_depth': 8,
'subsample': 0.8}
# predict on training data and evaluate
xgb_tune_pred_train = xgb_tune.predict(X_train_r)
metrics_score(y_train_r, xgb_tune_pred_train)
Accuracy: 0.9921
precision recall f1-score support
0 0.99 1.00 1.00 3353
1 1.00 0.96 0.98 819
accuracy 0.99 4172
macro avg 0.99 0.98 0.99 4172
weighted avg 0.99 0.99 0.99 4172
# predict on test data and evaluate
xgb_tune_pred = xgb_tune.predict(X_test_r)
metrics_score(y_test_r, xgb_tune_pred)
Accuracy: 0.915
precision recall f1-score support
0 0.94 0.96 0.95 1418
1 0.82 0.75 0.78 370
accuracy 0.91 1788
macro avg 0.88 0.85 0.87 1788
weighted avg 0.91 0.91 0.91 1788
# Build model with optimized parameters and adding early stopping
# instantiate model
xgb2 = XGBClassifier(scale_pos_weight=4,
objective='binary:logistic',
n_estimators=100, # 100 trees
seed=101,
nthread=4,
eta=0.15, # max learning step size
colsample_bytree=0.8, # portion of features available for each tree
colsample_bylevel=0.8, # portion of features available at each level
max_depth=8,
subsample=0.8, # sample using 80% of training data
early_stopping_rounds=5 # stop after 10 epochs of no improvement in eval metric
)
# define eval set for model to use
eval_set = [(X_test_r, y_test_r)]
# Fit the model; 2 metrics for plotting
xgb2.fit(X_train_r, y_train_r, eval_metric="logloss", eval_set=eval_set, verbose=False)
# predict on training data and evaluate
xgb_pred2_train_r = xgb2.predict(X_train_r)
metrics_score(y_train_r, xgb_pred2_train_r)
# predict on test data and evaluate
xgb_pred2_r = xgb2.predict(X_test_r)
metrics_score(y_test_r, xgb_pred2_r)
Accuracy: 0.9981
precision recall f1-score support
0 1.00 1.00 1.00 3353
1 0.99 1.00 1.00 819
accuracy 1.00 4172
macro avg 1.00 1.00 1.00 4172
weighted avg 1.00 1.00 1.00 4172
Accuracy: 0.9183
precision recall f1-score support
0 0.95 0.94 0.95 1418
1 0.79 0.83 0.81 370
accuracy 0.92 1788
macro avg 0.87 0.88 0.88 1788
weighted avg 0.92 0.92 0.92 1788
How do each of the tuned models perform?
# define lists for each metric
train_acc, test_acc, train_prc, test_prc, train_recall, test_recall, train_f1, test_f1 = [], [], [], [], [], [], [], []
# list to collect model names
model_names = []
# Train classifiers one by one
def metrics_table(model_name, y_train_predict, y_test_predict, y_train=y_train, y_test=y_test):
"""
Takes in actual and predicted y values
Calculates performance metrics for each
Combines data into one table for comparison
model_name: string, name to display for model
"""
train_acc.append(accuracy_score(y_true=y_train, y_pred=y_train_predict))
test_acc.append(accuracy_score(y_true=y_test, y_pred=y_test_predict))
train_prc.append(precision_score(y_true=y_train, y_pred=y_train_predict))
test_prc.append(precision_score(y_true=y_test, y_pred=y_test_predict))
train_recall.append(recall_score(y_true=y_train, y_pred=y_train_predict))
test_recall.append(recall_score(y_true=y_test, y_pred=y_test_predict))
train_f1.append(f1_score(y_true=y_train, y_pred=y_train_predict))
test_f1.append(f1_score(y_true=y_test, y_pred=y_test_predict))
model_names.append(model_name)
metrics_table('Logistic', lr_tune_opt_pred_tr, lr_tune_opt_pred)
metrics_table('D Tree', dt_pred_tr_tune_r, dt_pred_tune_r, y_train_r, y_test_r)
metrics_table('Random Forest', pred_rfc_tune_tr, pred_rfc_tune)
metrics_table('SVC', pred_svm_tune_tr, pred_svm_tune)
metrics_table('XGBoost', xgb_pred2_train_r, xgb_pred2_r, y_train_r, y_test_r)
# Aggregating information and displaying the results as a dataframe
train_results = [train_acc, test_acc, train_prc, test_prc,
train_recall, test_recall, train_f1, test_f1]
metrics = ['Accuracy (Train)', 'Accuracy (Test)', 'Precision (Train)', 'Precision (Test)',
'Recall (Train)', 'Recall (Test)', 'F1_Score (Train)', 'F1_Score (Test)']
metrics_df = pd.DataFrame(data=train_results, columns=model_names, index=metrics)
metrics_df.T
| Accuracy (Train) | Accuracy (Test) | Precision (Train) | Precision (Test) | Recall (Train) | Recall (Test) | F1_Score (Train) | F1_Score (Test) | |
|---|---|---|---|---|---|---|---|---|
| Logistic | 0.813998 | 0.815996 | 0.526124 | 0.556474 | 0.528694 | 0.545946 | 0.527406 | 0.551160 |
| D Tree | 0.855944 | 0.828859 | 0.589638 | 0.558394 | 0.875458 | 0.827027 | 0.704668 | 0.666667 |
| Random Forest | 0.948706 | 0.902685 | 0.843360 | 0.762032 | 0.907204 | 0.770270 | 0.874118 | 0.766129 |
| SVC | 0.943193 | 0.885347 | 0.806316 | 0.712082 | 0.935287 | 0.748649 | 0.866026 | 0.729908 |
| XGBoost | 0.998082 | 0.918345 | 0.990326 | 0.788660 | 1.000000 | 0.827027 | 0.995140 | 0.807388 |
# Would using a different scaling method work better? (ex MinMax)
# Need to look at SVC using scaled data
# Would DTree or RFC perform differently with outiers included?
1. Comparison of various techniques and their relative performance based on chosen Metric (Measure of success):
2. Refined insights:
3. Proposal for the final solution design: